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Abstract 

We consider the implementation of a parallel Monte 
Carlo code for high-performance simulations on PC clus- 
ters with MPL We carry out tests of speedup and efficiency. 
The code is used for numerical simulations of pure SU{2) 
lattice gauge theory at very large lattice volumes, in order 
to study the infrared behavior of gluon and ghost propaga- 
tors. This problem is directly related to the confinement of 
quarks and gluons in the physics of strong interactions. 



1. Introduction 

The strong force is one of the four fundamental interac- 
tions of nature (along with gravity, electromagnetism and 
the weak force). It is the force that holds together protons 
and neutrons in the nucleus. The strong interaction is de- 
scribed by Quantum Chromodynamics (QCD), a quantum 
field theory with local SU{3) gauge invariance |27|. QCD 
states that a baryon (e.g. a proton or a neutron) is not an el- 
ementary particle but is instead made up of building blocks 
called quarks, interacting through the exchange of mass- 
less particles called gluons (equivalent to the photons in the 
electromagnetic interaction). A unique feature of the strong 
force is that the particles that feel it directly — quarks and 
gluons — are completely hidden from us, i.e. they are never 
observed as free particles. This property is known as con- 
finement and makes QCD much harder to handle theoret- 
ically than the theories describing the weak and electro- 
magnetic forces. Indeed, it is not possible to study the con- 
finement problem analytically and physicists must therefore 
rely on numerical simulations performed on supercomput- 
ers. These studies are done using the lattice formulation 



of QCD 1281 . which is based on field quantization through 
Feynman integrals and discretization of space-time on a 
four-dimensional (Euclidean) lattice. In this formulation — 
introduced by Wilson in 1974 |32| — the theory becomes 
equivalent to a model in statistical mechanics and can be 
studied numerically by Monte Carlo simulations 1221 . After 
over two decades |7| of developments in the methodology 
for the numerical study of QCD and with present-day com- 
puters in the teraflops range, lattice-QCD simulations are 
now able to provide quantitative predictions with errors of 
a few percent. This means that these simulations will soon 
become the main source of theoretical results for compari- 
son with experiments in high-energy physics 1 15 1, enabling 
a much more complete understanding of the physics of the 
strong force. 

The study of lattice QCD constitutes a Grand Chal- 
lenge computational problem |14|. Consequently, lattice- 
QCD physicists are natural users of high-performance com- 
puting and have contributed to the development of super- 
computer technology itself. In fact, several research groups 
have built QCD-dedicated computers, using parallel archi- 
tecture. Examples are the Hitachi/CP-PACS machine 
at the University of Tsukuba in Japan fT\}, the QCDSP 
and QCDOC machines at Columbia University in the USA 
[26, 5 1, and the APE machines 01 various research 
centers in Italy and Germany. These computers range from 
about 1 to 10 teraflops. In addition to these large projects, 
many groups base their simulations on clusters of work- 
stations or personal computers (PC's) |25|, since costs are 
much lower and maintenance is simpler. 

In Brazil, the first PC cluster dedicated to lattice-QCD 
studies was installed in 2001 at the Physics Department of 
the University of Sao Paulo in Sao Carlos (IFSC-USP), 
as part of a FAPESP project. The group is currently in- 



vestigating the behavior of gluon and ghost propagators in 
Landau gauge fl3"'4l, with the goal of verifying Gribov's 
proposed mechanism for quark confinement 1 19 33 1. This 
study requires careful consideration of the infrared behav- 
ior of these propagators, i.e. their behavior at small momen- 
tum p (typically p ^ 1 GeV). Since the smallest non-zero 
momentum that can be considered on a lattice is given by 
Pmin ~ 2tt/L — wherc L is the size of the lattice in phys- 
ical units — it is clear that one needs to simulate at very 
large lattice sizes in order to probe the small-momentum 
limit. For example, to have Pmm ~ 0.06 GeV with a lat- 
tice spacing of about 0.17 fm (i.e. physically relevant values 
for small momentum and fine enough lattice'), one needs 
to simulate on a lattice with 140 sites in each direction. 
This is considerably more than what can be currently done 
in QCD simulations. On the other hand, Gribov's predic- 
tions are also valid for simpler cases of lattice gauge theo- 
ries 1331 . such as three-dimensional two-color QCD with in- 
finitely massive quarks, i.e. (pure) SU (2) lattice gauge the- 
ory in three dimensions. This corresponds to considering 
the SUi2) [instead of the SU{3)] gauge group |31|, tak- 
ing three (instead of the usual four) space-time dimensions 
and making the so-called quenched approximation 1 28 1 . In 
this case it was possible to simulate on 140"^ lattices and 
to see clear evidence of Gribov's predicted behavior for the 
gluon propagator 1 1 2 1 . This represents the largest number of 
points per direction ever considered in lattice-gauge-theory 
simulations. The study is currently being extended to even 
larger lattices (up to 260'^) aiming at a more quantitative un- 
derstanding of Gribov's confinement scenario. The consid- 
eration of very large lattice sizes requires parallelization and 
high efficiency of the code in order to obtain good statistics 
in the Monte Carlo simulation. Thus, an optimized paral- 
lel code is of great importance. 

The purpose of the present paper is to describe the imple- 
mentation of the code used in the study above, including a 
discussion of speedup (at fixed and variable volume) and ef- 
ficiency. The algorithms for these simulations and their par- 
allelization are briefly reviewed in Section |2l Our PC clus- 
ter is described in Section|3] together with the performance 
of the code. Finally, in Section |4] we comment on the re- 
sults and report our conclusions. 



are computed as weighted averages over the configurations, 
with the weight function above. For a generic observable 
0[U] this average is defined as 



-s[u] 



El 



-S[U] 



(1) 



In our case is given by the gluon field U^{x), which is 
a matrix defined on each site x and for each direction /i of 
the lattice. The observables we consider are the gluon and 
ghost propagators. 

The standard Wilson action for SU (2) lattice gauge the- 
ory in d dimensions is 1 32 1 



1 - ^TrP^. 



(2) 



where the plaquette P^^ is given by the product of the gluon 
fields [x) around a closed 1x1 loop: 

P^, = U^{x) U,{x + e^) U-\x + e,) U-\x) . (3) 

Here, U^{x) are SU (2) matrices, x (with coordinates a;^ = 
1, 2, . . . , Nf^) are sites on a d-dimensional lattice with peri- 
odic boundary conditions and is a unit vector in the pos- 
itive /i direction. The parameter (3 controls the proximity to 
the continuum limit. The action in eq. (|2} is invariant un- 
der the so-called local gauge transformation 

U^{x) ^ Uls){x) = g{x) U^ix) g~\x + e^) , (4) 

where g{x) are general SU{2) matrices. Indeed, gauge 
theories are systems with redundant dynamical variables, 
which do not represent true dynamical degrees of freedom. 
This implies that the objects of interest are not the gluon 
fields U^{x) themselves, but rather the classes (orbits) of 
gauge-related fields ujf^x). The elimination of such re- 
dundant gauge degrees of freedom is often essential for un- 
derstanding and extracting physical information from these 
theories. This is usually done by a method called gauge 
fixing, in which a unique representative is chosen on each 
gauge orbit 1 18 1. 

For an SU (2) matrix we shall use the parametrization 



2. The algorithms 

Lattice field theories are defined^ by a functional of the 
fields — the action S[U] — which determines the (unrenor- 
malized) statistical weig ht e-^I'^l of a given field config- 
uration {[/}. All quantities of interest, called observables, 



1 We note that 1 fm = 10 ^'^ cm is approximately the size of a proton. 

2 One usually considers units such that h = c = I, where h is the Plank 
constant and c is the speed of light in vacuum. 



g = go^ + ia-g 



go + igs 92 + igi 
-.92 + -igi 90 - igs 



, (5) 



where the components of ct = (cti,CT2,(T3) are the three 
Pauli matrices 1311 and • stands for scalar product. Then, 
the adjoint of a matrix g G SU{2) is given by 



golL-ia ■ g. 



(6) 



Also, note that the unitarity condition det g = I (where det 
indicates the determinant of a matrix) implies + gf + 



si + 53 = 1' namely an SU{2) matrix can be consid- 
ered as a four-dimensional unit vector For the gluon field 
Unix) S SU{2) one usually writes 

(x) = Ao.p {x)l. + ia- (x) . (7) 

Our goal is to evaluate numerically the gluon and ghost 
propagators D{k) and G{k), defined in Section l23] [see eqs. 
J28t and J29ll1. To this end one needs to (i) produce a ther- 
malized configuration {Uf^ {x)} by Monte Carlo simulation, 
(ii) gauge fix this configuration, (iii) evaluate the propaga- 
tors using the gauge-fixed configuration. These steps are de- 
scribed in detail in Sections [2. 1112. 21 1231 and are schemati- 
cally represented in the code below: 

main ( ) 
{ 

/* set parameters: beta, number of 
configurations NC, number of 
thermalization sweeps NT, etc. */ 
read_parameters () ; 
/* {U} is the link configuration */ 
set_initial_conf iguration (U) ; 

for (int c=0; c < NC; C++) { 
thermalize (U, NT) ; 
gauge_f ix (U, g) ; 
evaluate_propagators (U, D, G) ; 

} 

} 

Note that the gauge-fixing step consists in finding a 
gauge transformation {g{x)} [see eq. @] leading to a given 
gauge condition. In our case, since we are interested in Gri- 
bov's predictions, we employ the so-called Landau gauge. 

The general setup of our simulations is described in Sec- 
tionlU 

2.1. Thermalization 

In (dynamic) Monte Carlo simulations [22] the weighted 
configuration-space average defined in Q is substituted by 
a time average over successive realizations (i.e. configura- 
tions) of the considered system, which evolves according to 
a Markov process in the so-called Monte Carlo time. Usu- 
ally, the system is updated by sweeping over all sites of the 
lattice and generating a new value for the field at each site 
based on the conditional probability distribution obtained 
by keeping all other field variables fixed. In the heat-bath 
update, for example, this single-site distribution is sampled 
exactly. In QCD simulations one considers only effectively 
independent field configurations. This means that one fol- 
lows the system's evolution for a large enough number of 
time steps such that a statistically independent new config- 
uration is generated, discarding the intermediate steps. Per- 



forming the Monte Carlo iterations to obtain such indepen- 
dent field configurations is called thermalization. 

To thermalize the fields {U^{x)} we use a standard heat- 
bath algorithm 1 6 1 accelerated by hybrid overre taxation (Q . 
This corresponds to doing n micro-canonical (or energy- 
conserving) update sweeps over the lattice, followed by one 
local ergodic update (a heat-bath sweep). As explained be- 
low, the micro-canonical sweeps are important for a more 
efficient sampling of the configuration space. For the heat- 
bath update, one considers the contribution of a single link 
variable U^{x) to the Wilson action (|2ji. This single-link ac- 
tion is given by 

5sL = - ^ Tr [ U^{x) H^{x) ] + constant , (8) 
where the "effective magnetic field" H^{x) is defined as 

H^{x) = [ U.{x + e^) U;:\x + e,) U^H^) 

+ U-\x ~ + e^,)U-\x - e,)U,{x - e,)] . (9) 

Since the matrix H^{x) is proportional to an SU (2) matrix, 
we can write it as 

H^{x)=N^{x)H^,{x), (10) 

with H^,{x) e SU{2) and7V,,(a;) ee ^/detH^Jx). Then, 
by using the invariance of the group measure under group 
multiplication, one obtains the heat-bath update ||6l 

u^{x) ^ V h;\x) , (11) 

where the SU (2) matrix V = vq^ + ia ■ v must be gener- 
ated by choosing vq according to the distribution 

\/ 1 - Wo exp[/37V^(a;)wo]dwo (12) 

and the vector v (which is normalized to ^1 — Vq ) point- 
ing along a uniformly chosen random direction in three- 
dimensional space. 

The vector v can be easily generated. For example, if we 
use cylindrical coordinates we may take 

VI = ^ {1 - p^) (1 ~ vi) COS0 (13) 
V2 = y^(l-p2)(l_i,2) sin0 (14) 

^^3 = \/l-VoP (15) 

with p uniformly distributed in [—1, 1] and (j> uniformly dis- 
tributed in [0, 2 tt] . On the contrary, the problem of gener- 
ating Vq according to the distribution (I12> is considerably 
more involved. Three different rejection methods for this 
purpose are considered in |17 Appendix A]. Here, we use 
their algorithms called method 1 and method 2, with a cut- 
off value of 2.0 for the quantity (3N^{x), namely method 1 



is used when pj\ff^{x) < 2.0 and method 2 is used other- 
wise. 

In order to implement the heat-bath method one also 
needs a (parallelized) random number generator. Here we 
adopt the RANLUX generator, which is based on chaos the- 
ory 1 24 1 . More precisely, we use a double-precision imple- 
mentation of RANLUX (version 2.1) with luxury level set to 
2. 

Let us now consider the (deterministic) micro-canonical 
update, used in the hybrid overrelaxed algorithm: 

U^{x) ^ H-\x)Tv [U^^ix) H^{x)^ - f/^(x). (16) 

From formulae (|8j and ( I10> it is easy to see that this update 
does not change the value of the action Ssl- On the other 
hand, the step in (I16> represents a large move in configu- 
ration space Thus, one can alternate micro-canonical 
sweeps of the lattice and heat-bath updates in order to re- 
duce the problem of critical slowing-down, which afflicts 
Monte Carlo simulations of critical phenomena |30|. The 
efficiency of the hybrid overrelaxed algorithm may be op- 
timized by tuning the value of n, i.e. the number of micro- 
canonical sweeps between two heat-bath sweeps. 

2.2. Landau gauge fixing 

For a given thermalized lattice configuration {[/^(x)}. 
Landau gauge fixing is obtained by looking for a gauge 
transformation {^g{x) G SU (2)} that brings the functional 



^^Tr[5(a;) (74a;)5t(a; + e^)] (17) 



to a local minimum, starting from randomly chosen {(7(2;)} 
II18I . Thus, from the numerical point of view, fixing the lat- 
tice Landau gauge is a minimization problem. Here, we 
consider three different (iterative) gauge-fixing algorithms 
lis] 121: the so-called Cornell (COR) method, the stochas- 
tic overrelaxation (SOR) algorithm and the Fourier accel- 
eration (FA) algorithm. Let us notice that the first two al- 
gorithms are based on local updates for the matrices g{x) 
and have dynamic critical exponent z = 1. This means that 
the number of iterations required to achieve a given accu- 
racy in the minimization of the functional Ejj [g] grows as a 
function of the lattice side N as A^'^+^, when considering a 
symmetric lattice in d dimensions. On the other hand, the 
FA algorithm is based on a global update and has z = [at 
least for sufficiently smooth field configurations {?7^(x)}]. 
Its computational work grows roughly as N''-. Thus, even 
though the CPU time necessary to update a single-site vari- 
able g{x) is much smaller for the two local methods than for 
the FA method, the latter should clearly be used when con- 
sidering very large values of N . 



The update g{x) gnew{x) for the COR and the SOR 
methods can be written in terms of local quantities — i.e. 
quantities defined only in terms of the site x — and of the 
matrix 



h{x) = J2 g\x + Bf,) 

+ Ul{x - ef,)g\x - e^)] 



(18) 



On the contrary, for the FA method one needs to evaluate 



u{x) = ( —A ) ^ w (x) 



(19) 



where w(x) = g{x)h{x), the three-dimensional vector w is 
given by [see eq. (|5}] 



2i(T ■ w(a;) = w(.t) — w^(a:) 



(20) 



and —A is (minus) the lattice Laplacian, defined for a gen- 
eral vector field f{x) as 



a 

(-A/)(x)^^[2/(a:)-/(a 



(21) 

From eq. ( I19> it is evident that the FA method is actually a 
Laplacian preconditioning algorithm. 

Traditionally the inversion of the lattice Laplacian is 
done using a fast Fourier transform (FFT), after writing 



(-A)- 



P 



(22) 



where F indicates the Fourier transform, F^^ is its inverse 
and is the squared magnitude of the lattice momentum. 
Alternatively |10|, the inversion of the Laplacian may be 
done using a multigrid (MG) algorithm or a conjugate gra- 
dient (CG) method, avoiding the use of the FFT, which has 
high communication costs in a parallel implementation. In- 
deed, one obtains IIOIII II the same convergence as for the 
original algorithm (based on FFT), when using an accuracy 
of about lO^'^ for the MG or CG (iterative) inversion. At 
the same time, the computational cost of the new imple- 
mentations is smaller than that of the FFT-FA method when 
considering large lattice volumes. This is true even for a 
non-parallelized code. Moreover, the MG-FA and CG-FA 
algorithms are well suited for vector and parallel machines 
and they make the FA method more flexible, i.e. it works 
equally well with any lattice side.^ Here we do the inver- 
sion of the Laplacian using a CG method preconditioned 
with red/black ordering 1 16 1. As stopping criterion we con- 
sider Tt/ro < 10^'^, where rt is the magnitude of the CG 



3 We note that the FF T is sUghtly less eiBcient for lattice sides A'^ that 
are not powers of 2 f34fl. 



residual after t iterations. We note that, with this stopping 
criterion, one can do the inversion in single precision, even 
though the rest of the code is written in double precision. 
This corresponds to a speed-up of almost a factor 2 in the 
inversion. 

Let us stress that the three algorithms considered above 
require the tuning of a free parameter in order to attenu- 
ate critical slowing-down, or equivalently in order to reduce 
the computational work. Notice however that the CPU time 
necessary to update a single-site variable g{x) is essentially 
independent of the value of the tuning parameter 

2.2.1. Convergence of the gauge fixing Several quanti- 
ties have been introduced in order to check the convergence 
of Landau-gauge-fixing algorithms j8| . We consider here 



(23) 



which is commonly used in numerical simulations, and 



3 



= wE^E E 



^=1 



T=: 1 2 



(24) 

which provides a very sensitive test of the goodness of the 
gauge fixing. Let us recall that 

d 

(V • Ab) (x) = J2 [AbA^) - Ab,^{x - e^)] (25) 

is the lattice divergence of Ab^^i{x^ [see eq. Q]. We also 
define 



1 



(26) 



where the quantities 



Qh^Xy) = E E ^ = 1, . . . ,d 

(27) 

are constant (i.e. independent of a;^;) if the Landau-gauge- 
fixing condition is satisfied. The two quantities (VA)^ and 
Eq are expected to converge to zero exponentially (and 
with the same exponent) as a function of the number of 
gauge-fixing sweeps, even though their sizes may differ 
considerably. 

2.3. Evaluation of the propagators 

A propagator of a field is a two-point function, i.e. a cor- 
relation function between values of the field at two different 
points in space-time |23 1. In quantum mechanics, the prop- 
agator determines the evolution of the wave function of a 
system and, for a particle, it gives the probability amplitude 



of going (i.e. propagating) from a point in space-time to an- 
other |29|. More generally. Green's functions (i.e. n-point 
functions) carry all the information about the physical and 
mathematical structure of a quantum field theory. From this 
point of view, two-point functions (propagators) are a the- 
ory's most basic quantities and the gluon propagator may 
be thought of as the most basic quantity of QCD. The ghost 
propagator appears in the theory as a consequence of the 
gauge-fixing procedure described above. 

The gluon propagator is conveniently defined in momen- 
tum space as 



D{k) « E ^ I E ^f>,^L{x) exp {2'Kik ■ x) |^ 



(28) 



Here, b goes from 1 to 3 [when considering the SU{2) 
gauge group], /i goes from 1 to d (d = 3 for three- 
dimensional space-time), k has components k^ taking val- 
ues fcp Nf^ = 0, 1, . . . , — 1 and the field Af,^^(x) is de- 
fined in eq. 0. Note that • stands for scalar product and 
... I indicates the norm of a complex number. 
The numerical evaluation of the ghost propagator (in mo- 
mentum space) is considerably more involved. In fact, one 
has to calculate 



Gik)^J2' 



-k-(x-y) 



J2{{M-')^,ix,y-U)), (29) 



where the matrix Mab{x, y, U) is a sparse matrix that de- 
pends on the gluon field U^{x). (For an explicit definition 
of this matrix see eq. (B.18) in 1331 .') Note that, since the 
color indices a and b go from 1 to 3 [for the SU (2) group] 
and if there are N'^ lattice sites, the size of this matrix is 
SiV'' X 3A^''. 

The inversion of the matrix Aiab{x,y', U) can be done 
using a CG method with red/black ordering, as in the case 
of the lattice Laplacian considered above for gauge fixing. 
This part of our code has not been parallelized yet, but we 
expect to obtain a speedup comparable to the one obtained 
for the lattice-Laplacian case. 

2.4. Parallelization 

As said above, we need a parallelized code in order to 
simulate at very large lattice sizes. We have started by con- 
sidering the QCDMPI package |35|, which is based on the 
work of Hioki |20|. The advantages of this package are its 
portability and the efficient way of evaluating the effective 
magnetic field [x) — also called the staple — defined in 
eq. (|9}. In particular, the extra memory space required for 
communication is considerably reduced with respect to pre- 
vious implementations. The original QCDMPI code is writ- 
ten for pure 5[/(3) lattice gauge theory in d dimensions 
(d > 2) and performs only the (heat-bath) thermalization 
step of the simulation. We have adapted the original code 



to the SU{2) case and improved the generation of vq ac- 
cording to the distribution (I12> . More precisely (see Sec- 
tion|3A), we have added method 1 and a more efficient ver- 
sion of method 2. At the same time, we have introduced the 
micro-canonical step, the various Landau-gauge-fixing al- 
gorithms discussed in Section|2]B (i.e. the COR, SOR and 
CG-FA methods), the calculation of the quantities (VA)^ 
and Sq for checking the convergence of the gauge fixing, 
and the evaluation of the gluon propagator (As mentioned 
above, we have not yet parallelized the evaluation of the 
ghost propagator) 

For the parallelization, we divide the lattice equally 
among the nodes, i.e. we place v = V/M sites of the lat- 
tice in each node, where V is the lattice volume and we use 
M nodes. Each node gets a contiguous block of lattice sites. 
We will refer to this block of sites as the local lattice in a 
node. Note that, in general, not all directions /i of the lat- 
tice are divided between different nodes and that, in order 
to use a red/black ordering, the number of sites v in each 
node must be even. 

Let us stress that communication is required for the eval- 
uation of the staple i7p(a;), for the calculation of h{x) [see 
eq. (I18H and (in the CG-FA method) for the inversion of 
the lattice Laplacian [see eq. (I21H . Also, the evaluation of 
the quantities (VA)^ and (see Section IJlB.l) requires 
some level of communication in a parallel code, while for 
the gluon propagator [see eq. ( I28H one has to perform only 
a sum over the whole lattice. Since the expressions to be 
parallelized involve at most quantities at lattice sites that are 
nearest neighbors, communication is required only for sites 
on the boundary of the local lattice in a node. Moreover, 
simulations are usually done in three or four dimensions, 
leading to a high granularity due to the surface/volume ef- 
fect. 

All communications in [201 are carried out using just two 
subroutines. The first (called set link) sends data from a 
node to the previous one in a given direction. The second 
subroutine (called slidematrix) sends data from a node 
to the next one (in a given direction). Clearly, these two rou- 
tines are all we need in order to perform the communica- 
tions required in our code. 



3. Performance 

As mentioned in the Introduction, our simulations were 
done on a PC cluster at the IFSC-USP. The system has 16 
nodes and a server, all with 866 MHz Pentium III CPU. The 
nodes have 256 MB RAM memory (working at 133 MHz) 
and the operating system is Debian GNU/Linux (ver- 
sion 3.0rO). The machines are connected with a 100 Mbps 
full-duplex network through a 3COM switch. All user direc- 
tories are located on the server, which has two SCSI disks. 



and are mounted by the nodes (using NFS). The server is 
not used for the computations. 

Our code (as well as the QCDMPI package) is writ- 
ten in FORTRAN 7 7 making use of MP I for communica- 
tion. The code may be run for a general lattice dimension 
d > 2. As said before, we consider here d — 3. We use 
MPICH (version 1.2.1-16) and the compiler g7 7 (version 
0.5.24). The compilation has been done with the follow- 
ing four options -march=pentiumpro -fomit-frame-pointer 
-mpreferred-stack-boundary=2 -03 . 



Table 1: Average CPU-time (in /is) to update a link vari- 
able U^{x) using heat-bath {thb) or micro-canonical update 



M 


Node topol. 


thb 




1 


1x1x1 


10.341(4) 


6.0190(1) 


2 


2x1x1 


5.6(2) 


3.2099(4) 


4 


2x2x1 


2.78(2) 


1.6958(3) 


4 


4x1x1 


2.898(6) 


1.817(3) 


8 


2x2x2 


1.435(7) 


0.8881(3) 


8 


4x2x1 


1.48(2) 


0.9125(6) 


8 


8x1x1 


1.9(1) 


1.236(4) 


16 


4x2x2 


0.758(7) 


0.4732(3) 


16 


4x4x1 


0.75(1) 


0.4614(4) 


16 


8x2x1 


0.849(8) 


0.5677(3) 


16 


16 X 1 X 1 


1.25(1) 


0.6735(5) 



We now describe the setup of a complete simulation. 
As said in Section |2]A, we generate statistically indepen- 
dent field configurations to be used for the evaluation of the 
Monte Carlo average of an observable, which provides an 
estimate of the average in eq. ([ij. In order to reduce the sta- 
tistical error (i.e. the Monte Carlo error) of this estimate, 
one typically needs to produce hundreds of such configura- 
tions. For each of them there are two computational steps 
involved: the Monte Carlo generation of a new indepen- 
dent field configuration and the evaluation of the desired ob- 
servables. The first step is the thermalization, which in our 
case is done using the hybrid overrelaxed algorithm. This 
usually requires hundreds of sweeps of the lattice, corre- 
sponding to several hours to produce a new configuration. 
The second step is often even more time-consuming. In our 
case the evaluation of the observables (i.e. the propagators) 
is done only after the new configuration has been gauge- 
fixed, by an iterative minimization procedure. One usually 
needs thousands of iterations in order to reach a prescribed 
accuracy (for example Sg < 10^^^, required for a good 
quality of the gauge fixing). The actual computation of the 
gluon propagator requires a negligible time. (This is not true 
for the ghost propagator) Consequently, for typical values 



of lattice volumes, a complete simulation may take several 
months. For example, in order to produce the (preliminary) 
data reported in 1 12 1, the code was running on our PC clus- 
ter for almost three months. The production of the corre- 
sponding final results is expected to take almost one year of 
runs. As an illustration, for V — 140^ and [3 — 6.0 the aver- 
age CPU-times (per configuration) using 4 nodes are: about 
8 hours for thermalization and about 21 hours for gauge fix- 
ing (using COR or SOR methods). We note that the total 
CPU-time per configuration is not appreciably affected by 
changes in the parameter f3. [More precisely, for smaller (re- 
spectively larger) values of f3 one spends a little less (resp. 
more) time for thermalization and a little more (resp. less) 
time for gauge fixing.] We plan to use the CG-FA method 
in future production runs, to reduce the percentage of time 
spent in gauge fixing. 

We report below on some runs performed for testing the 
speedup and efficiency of our code. 



Table 2: Average CPU-time (in /is) to update a site variable 
g{x) using gauge-fixing methods COR (tcor), SOR (tstoc) 
or CG-FA {teg). Errors are one standard deviation. 



M 


Node topol. 


icor 


^stoc 


teg 


1 


1x1x1 


5.606(2) 


6.272(2) 


253(1) 


2 


2x1x1 


2.9659(7) 


3.295(1) 


136.2(2) 


4 


2x2x1 


1.560(1) 


1.7232(5) 


70.9(3) 


4 


4x1x1 


1.6063(5) 


1.789(1) 


73.6(3) 


8 


2x2x2 


0.819(1) 


0.9011(4) 


37.5(1) 


8 


4x2x1 


0.833(1) 


0.9175(7) 


37.2(1) 


8 


8x1x1 


1.0228(5) 


1.1133(4) 


43.5(1) 


16 


4x2x2 


0.4383(3) 


0.4771(2) 


18.99(5) 


16 


4x4x1 


0.4315(4) 


0.4706(1) 


18.43(5) 


16 


8x2x1 


0.4870(6) 


0.5296(2) 


19.22(6) 


16 


16 X 1 X 1 


0.5924(4) 


0.644(3) 


27.29(5) 



3.1. Speedup at fixed volume 



We did tests for the various algorithms, considering a 
fixed lattice volume V — 64^, using 1, 2, 4, 8 and 16 nodes. 
Let us note that this lattice size is relatively small. In fact, 
it can be simulated on a single node (without paralleliza- 
tion) using less than 20% of the memory (about 24% if one 
employs the CG-FA method for gauge fixing). Thus, since 
communications are proportional to the surface area of the 
local lattice in a node, these tests correspond to a worse sit- 
uation than the one we considered for our production runs 
infT2l. 

In Tables Q and |2l we report the average CPU-time (in 
micro-seconds) necessary to update a link variable U^{x) 
using a heat-bath {thb) or a micro-canonical update {tmc), 
and the time to update a site variable g{x) using the gauge- 
fixing methods COR (tcor), SOR (tstoc) or CG-FA (t^g). 
These CPU-times are given for different values of the num- 
ber of nodes M and different (three-dimensional) node 
topology. 

3.2. Speedup at variable volume 

We also did tests at variable volume, considering five dif- 
ferent node topologies: 1x1x1, 1x1x2, 1x2x2, 
2x2x2 and 2x2x4, corresponding respectively to 
M — 1, 2, 4. 8 and 16 nodes. For each node topology we 
have simulated using three different lattice volumes V . Re- 
sults of these tests are reported in Tables |3] and |4] for dif- 
ferent numbers of nodes M and for the various lattice vol- 
umes. As can be seen from the first two columns in these ta- 
bles, the lattice volumes have been chosen so that the local 
lattice volume v ~ V/M is always given by one of the fol- 
lowing cases: 4^, 16"^ and 64^. Let us note that this arrange- 



ment is closer to what is usually considered for production 
runs. In fact, when carrying out parallel simulations at in- 
creasingly large lattice volumes on a PC cluster, it is prefer- 
able to fill up the memory in each node before increasing 
the number of nodes. (This reduces the percentage of time 
spent in communication.) 



Table 3: Average CPU-time (in /is) to update a link vari- 
able U^{x) using heat-bath {thb) or micro-canonical update 



M 


V 


thb 


^rnc 


1 


43 


4.76(1) 


1.028(2) 


1 


16^ 


8.05(6) 


4.1821(9) 


1 


643 


10.383(5) 


6.0186(1) 


2 


4^ X 8 


9.19(2) 


7.08(1) 


2 


16^ X 32 


4.9(1) 


2.7561(9) 


2 


642 X 128 


5.39(7) 


3.1184(2) 


4 


4 X 8^ 


8.1(2) 


7.4(3) 


4 


16 X 322 


2.708(9) 


1.6940(9) 


4 


64 X 128^ 


2.768(3) 


1.6175(3) 


8 


83 


5.94(8) 


5.02(6) 


8 


323 


1.54(2) 


1.0090(2) 


8 


1283 


1.55(8) 


0.8447(3) 


16 


8^ X 16 


5.8(6) 


4.4(2) 


16 


32^ X 64 


0.763(2) 


0.5111(3) 


16 


128^ X 256 


0.77(1) 


0.436(1) 



Table 4: Average CPU-time (in /is) to update a site variable 
g{x) using gauge-fixing methods COR (tcor), SOR itstoc) 
or CG-FA {teg)- Errors are one standard deviation. 



M 


V 


^cor 


^stoc 


teg 


1 


4^ 


1.21(7) 


1.60(1) 


4.26(3) 


1 


16^ 


3.90(1) 


4.224(6) 


19.82(4 


1 


64^ 


5.606(2) 


6.272(2) 


253(1) 


2 


4^ X 8 


6.41(8) 


6.79(4) 


117.2(6) 


2 


W X 32 


2.413(5) 


2.679(4) 


31.38(6) 


2 


642 X 128 


2.885(1) 


3.2329(4) 


122(3) 


4 


4 X 8^ 


6.2(2) 


6.10(6) 


125.1(2) 


4 


16 X 32^ 


1.497(3) 


1.571(3) 


26.47(5) 


4 


64 X 128^ 


1.4935(2) 


1.6634(3) 


63.7(7) 


8 


8^ 


4.53(6) 


4.64(5) 


96.1(9) 


8 


32''' 


0.862(2) 


0.927(1) 


18.83(8) 


8 


128^ 


0.7620(2) 


0.8528(1) 


31.5(2) 


16 


8^ X 16 


3.56(7) 


4.4(2) 


89.6(7) 


16 


32^ X 64 


0.4496(7) 


0.480(1) 


14.25(4) 


16 


128^ X 256 


0.3942(3) 


0.4385(2) 


17.6(5) 



4. Results and conclusions 

The CPU-times reported above indicate that the paral- 
lelization is rather good for the five algorithms considered 
(the heat-bath and micro-canonical methods for thermaliza- 
tion, and the COR, SOR and CG-FA methods for gauge fix- 
ing). Also, the values for the speedup S = ti/tM (and the 
efficiency E = S/M) are very similar for the five cases. 
We therefore take averages over the five methods and report 
them in Table|5l In the fixed-volume case we present results 
for node topologies 2x2x1, 2x2x2 and 4x4x1 respec- 
tively for the cases M = 4, 8 and 16. (This corresponds to 
the best performance for a given number of nodes.) In the 
variable- volume case we consider results obtained using the 
largest lattice volume for each node topology. We clearly 
see that one obtains a good parallelization even in the fixed- 
volume case and using a relatively small lattice volume V . 
Indeed, the efficiency decreases slowly when doubling the 
number of nodes and it is still about 0.84 at M — 16. As 
expected, the performance is better for the variable- volume 
case when considering large values of the local lattice vol- 
ume ti in a node. As said above, this test is closer to the sit- 
uation usually considered in production runs, with local lat- 
tice sizes using more than 50% of the memory of each node. 

The data for the speedup at variable volume reported in 
Tableware well fitted by the function S{M) = A/ (1 - 
c log Af) with c = 0.038 ± 0.001. The corresponding plot 
is shown in Figure^ This would mean a speedup of almost 
400 for 512 nodes. 

Note that the efficiency loss in going from 1 to M nodes 



Table 5: Average speedup and efficiency. 





Fixed volume 


Variable volume 


S 


E 


S 


E 


1 - 


2 


1.87(2) 


0.937(8) 


1.96(2) 


0.981(8) 


1 - 


4 


3.61(1) 


0.904(3) 


3.79(1) 


0.948(3) 


1 - 


8 


6.91(2) 


0.863(2) 


7.31(8) 


0.91(1) 


1 - 


> 16 


13.38(6) 


0.836(3) 


14.5(3) 


0.88(1) 



at variable volume V and small v is due to the effect of node 
communication over the usage of the cache memory. 

We conclude that the parallelization of our code works 
very well and that simulations of this type are very viable 
on a PC cluster. Let us point out again the main results 
of the parallel implementation described in this paper We 
have adapted and extended the package QCDMPI, which 
shows good parallelization but performs only thermaliza- 
tion of the gluon fields. The resulting code employs a more 
efficient thermalization algorithm and also performs gauge 
fixing and evaluation of propagators, with essentially the 
same parallel performance as the original package. 

For the future we plan to (i) parallelize the codes for 
the evaluation of the ghost propagator and for the MG-FA 
gauge-fixing method, (ii) improve the usage of the cache 
memory and (iii) introduce overlap of computation and 
communication. 



Figure 1 : Code speedup S at variable volume as a function 
of the number of nodes M. 
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